Contents
ix
4.2.5
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
4.3
CP-NAS: Child-Parent Neural Architecture Search for 1-bit CNNs . . . . .
98
4.3.1
Child-Parent Model for Network Binarization . . . . . . . . . . . . .
100
4.3.2
Search Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
102
4.3.3
Search Strategy for CP-NAS
. . . . . . . . . . . . . . . . . . . . . .
103
4.3.4
Optimization of the 1-Bit CNNs
. . . . . . . . . . . . . . . . . . . .
103
4.3.5
Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
104
4.4
DCP-NAS: Discrepant Child-Parent Neural Architecture Search for 1-Bit
CNNs
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
105
4.4.1
Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
105
4.4.2
Redefine Child-Parent Framework for Network Binarization . . . . .
107
4.4.3
Search Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
108
4.4.4
Tangent Propagation for DCP-NAS
. . . . . . . . . . . . . . . . . .
109
4.4.5
Generalized Gauss-Newton Matrix (GGN) for Hessian Matrix . . . .
110
4.4.6
Decoupled Optimization for Training the DCP-NAS . . . . . . . . .
111
4.4.7
Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
115
5
Applications in Natural Language Processing
118
5.1
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
118
5.1.1
Quantization-Aware Training (QAT) for Low-Bit Large Language
Models
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
118
5.1.2
Post-Training Quantization (PTQ) for Low-Bit Large Language
Models
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
118
5.1.3
Binary BERT Pre-Trained Models . . . . . . . . . . . . . . . . . . .
119
5.2
Fully Quantized Transformer for Machine Translation
. . . . . . . . . . . .
121
5.2.1
Quantization Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . .
121
5.2.2
What to Quantize
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
122
5.2.3
Tensor Bucketing . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
123
5.2.4
Dealing with Zeros . . . . . . . . . . . . . . . . . . . . . . . . . . . .
124
5.3
Q-BERT: Hessian-Based Ultra Low-Precision Quantization of BERT . . . .
125
5.3.1
Hessian-Based Mix-Precision
. . . . . . . . . . . . . . . . . . . . . .
125
5.3.2
Group-Wise Quantization . . . . . . . . . . . . . . . . . . . . . . . .
125
5.4
I-BERT: Integer-Only BERT Quantization . . . . . . . . . . . . . . . . . . .
127
5.4.1
Integer-Only Computation of GELU and Softmax
. . . . . . . . . .
128
5.4.2
Integer-Only Computation of LayerNorm
. . . . . . . . . . . . . . .
128
5.5
Toward Efficient Post-Training Quantization of Pre-Trained Language
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
129
5.5.1
Module-Wise Reconstruction Error Minimization . . . . . . . . . . .
129
5.5.2
Model Parallel Strategy . . . . . . . . . . . . . . . . . . . . . . . . .
130
5.5.3
Annealed Teacher Forcing . . . . . . . . . . . . . . . . . . . . . . . .
130
5.6
Outlier Suppression: Pushing the Limit of Low-Bit Transformer Language
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
132
5.6.1
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
132
5.6.2
Gamma Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . .
133
5.6.3
Token-Wise Clipping . . . . . . . . . . . . . . . . . . . . . . . . . . .
134
5.7
BinaryBERT: Pushing the Limit of BERT Quantization . . . . . . . . . . .
134
5.7.1
Ternary Weight Splitting
. . . . . . . . . . . . . . . . . . . . . . . .
136
5.7.2
Knowledge Distillation . . . . . . . . . . . . . . . . . . . . . . . . . .
136
5.8
BEBERT: Efficient and Robust Binary Ensemble BERT . . . . . . . . . . .
138
5.9
BiBERT: Accurate Fully Binarized BERT . . . . . . . . . . . . . . . . . . .
139
5.9.1
Bi-Attention
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
139